[ROCm] gemm_a16w16 upstreaming #26969

maleksan85 · 2025-10-16T01:03:23Z

GPT OSS, m and n to check: ROCm@bcc4e69

HIP_VISIBLE_DEVICES=7 \
HSA_NO_SCRATCH_RECLAIM=1 \
NCCL_MIN_NCHANNELS=112 \
USE_FASTSAFETENSOR=1 \
SAFETENSORS_FAST_GPU=1 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_USE_AITER_UNIFIED_ATTENTION=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
vllm serve /data/models/openai/gpt-oss-120b \
    --host localhost \
    --port 30000 \
    --tensor-parallel-size 1 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 64 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 2048 \
    --swap-space 16 \
    --block-size 64 \
    --async-scheduling \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'

vllm bench serve \
  --host localhost \
  --port 30000 \
  --model /data/models/openai/gpt-oss-120b \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --random-prefix-len 0 \
  --request-rate "inf" \
  --max-concurrency 64 \
  --num-prompts 640 \
  --ignore-eos \
  --percentile-metrics ttft,tpot,itl,e2el

HIP_VISIBLE_DEVICES=7 \
HSA_NO_SCRATCH_RECLAIM=1 \
NCCL_MIN_NCHANNELS=112 \
USE_FASTSAFETENSOR=1 \
SAFETENSORS_FAST_GPU=1 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_USE_AITER_UNIFIED_ATTENTION=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
python /data/vllm-scripts/llm_test.py \
    --model /data/models/openai/gpt-oss-120b \
    --dataset-path /data/models/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json \
    --batch-size 4 \
    --tensor-parallel-size 1 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 32 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 2048 \
    --swap-space 16 \
    --block-size 64 \
    --async-scheduling \
    --no-enable-prefix-caching \
    --compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'

with change:

============ Serving Benchmark Result ============
Successful requests:                     640
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  132.96
Total input tokens:                      655360
Total generated tokens:                  655360
Request throughput (req/s):              4.81
Output token throughput (tok/s):         4929.13
Peak output token throughput (tok/s):    5696.00
Peak concurrent requests:                128.00
Total Token throughput (tok/s):          9858.26
---------------Time to First Token----------------
Mean TTFT (ms):                          400.59
Median TTFT (ms):                        352.67
P99 TTFT (ms):                           1099.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.60
Median TPOT (ms):                        12.57
P99 TPOT (ms):                           13.53
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.60
Median ITL (ms):                         11.89
P99 ITL (ms):                            13.90
----------------End-to-end Latency----------------
Mean E2EL (ms):                          13290.38
Median E2EL (ms):                        13212.19
P99 E2EL (ms):                           14837.24
==================================================

without change

============ Serving Benchmark Result ============
Successful requests:                     640
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  134.69
Total input tokens:                      655360
Total generated tokens:                  655360
Request throughput (req/s):              4.75
Output token throughput (tok/s):         4865.53
Peak output token throughput (tok/s):    5632.00
Peak concurrent requests:                128.00
Total Token throughput (tok/s):          9731.05
---------------Time to First Token----------------
Mean TTFT (ms):                          396.52
Median TTFT (ms):                        387.08
P99 TTFT (ms):                           966.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.77
Median TPOT (ms):                        12.76
P99 TPOT (ms):                           13.43
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.77
Median ITL (ms):                         12.09
P99 ITL (ms):                            14.52
----------------End-to-end Latency----------------
Mean E2EL (ms):                          13464.16
Median E2EL (ms):                        13493.87
P99 E2EL (ms):                           14258.33
==================================================

with triton gemm_a16w16
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                   _gemm_a16_w16_kernel         0.00%       0.000us         0.00%       0.000us       0.000us        2.492s        12.20%        2.492s      15.687us        158877


no triton gemm:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Cijk_Alik_Bljk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT32x64...         0.00%       0.000us         0.00%       0.000us       0.000us        1.415s         6.82%        1.415s      26.839us         52704
Cijk_Alik_Bljk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT16x64...         0.00%       0.000us         0.00%       0.000us       0.000us     829.718ms         4.00%     829.718ms      15.754us         52668
Cijk_Alik_Bljk_BBS_BH_Bias_HA_S_SAV_UserArgs_MT16x16...         0.00%       0.000us         0.00%       0.000us       0.000us     746.112ms         3.60%     746.112ms      14.147us         52740

1.2x speed up over hipBlasLT

lm_eval --model vllm --model_args pretrained=/data/models/openai/gpt-oss-120b,tensor_parallel_size=1,max_gen_toks=2048 --tasks gsm8k --batch_size auto --num_fewshot 5 --limit 250 --apply_chat_template

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

vllm/model_executor/layers/utils.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/utils.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

gemm_a16w16 upstreaming

eef1b16

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify bot added the rocm Related to AMD ROCm label Oct 16, 2025

triton fp16 kernel

5538c0f

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify bot added the gpt-oss Related to GPT-OSS models label Oct 17, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Oct 17, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Oct 17, 2025

triton fp16 kernel

1350384

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

dllehr-amd reviewed Oct 20, 2025

View reviewed changes

vllm/model_executor/layers/utils.py Outdated Show resolved Hide resolved

Aleksandr Malyshev added 4 commits October 20, 2025 21:54

Merge branch 'upstream/main' into upstream_gemm_a16w16

fe8e0c6

Torch compile fix

feaae0f

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

merge with main

ecd92fe

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

removed flag, added missed bias

70b2746

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 marked this pull request as ready for review October 27, 2025 23:30

maleksan85 requested review from aarnphm and chaunceyjiang as code owners October 27, 2025 23:30

mergify bot added frontend tool-calling labels Oct 27, 2025

github-project-automation bot added this to Tool Calling Oct 27, 2025

chatgpt-codex-connector bot reviewed Oct 27, 2025

View reviewed changes

vllm/model_executor/layers/utils.py Show resolved Hide resolved

minor corrections

f79a9ff

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 28, 2025

merge with main

441d6b5

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[ROCm] gemm_a16w16 upstreaming #26969

[ROCm] gemm_a16w16 upstreaming #26969

maleksan85 commented Oct 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

[ROCm] gemm_a16w16 upstreaming #26969

Are you sure you want to change the base?

[ROCm] gemm_a16w16 upstreaming #26969

Conversation

maleksan85 commented Oct 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maleksan85 commented Oct 16, 2025 •

edited by github-actions bot

Loading